Comparing a Linguist ic and a Stochast ic Tagger

ثبت نشده
چکیده

Concerning different approaches to automatic PoS tagging: EngCG-2, a constraintbased morphological tagger, is compared in a double-blind test with a state-of-the-art statistical tagger on a common disambiguation task using a common tag set. The experiments show that for the same amount of remaining ambiguity, the error rate of the statistical tagger is one order of magnitude greater than that of the rule-based one. The two related issues of priming effects compromising the results and disagreement between human annotators are also addressed. 1 I n t r o d u c t i o n There are currently two main methods for automatic part-of-speech tagging. The prevailing one uses essentially statistical language models automatically derived from usually hand-annotated corpora. These corpus-based models can be represented e.g. as collocational matrices (Garside et al. (eds.) 1987: Church 1988), Hidden Markov models (cf. Cutting et al. 1992), local rules (e.g. Hindle 1989) and neural networks (e.g. Schmid 1994). Taggers using these statistical language models are generally reported to assign the correct and unique tag to 95-97% of words in running text. using tag sets ranging from some dozens to about 130 tags. The less popular approach is based on hand-coded linguistic rules. Pioneering work was done in the 1960"s (e.g. Greene and Rubin 1971). Recently, new interest in the linguistic approach has been shown e.g. in the work of (Karlsson 1990: Voutilainen et al. 1992; Oflazer and Kuru6z 1994: Chanod and Tapanainen 1995: Karlsson et al. (eds.) 1995; Voutilainen 1995). The first serious linguistic competitor to data-driven statistical taggers is the English Constraint Grammar parser. EngCG (cf. Voutilainen et al. 1992; Karlsson et al. (eds.) 1995). The tagger consists of the following sequentially applied modules: 1. Tokenisation 2. Morphological analysis (a) Lexical component (b) Rule-based guesser for unknown words 3. Resolution of morphological ambiguities The tagger uses a two-level morphological analyser with a large lexicon and a morphological description that introduces about 180 different ambiguity-forming morphological analyses, as a result of which each word gets 1.7-2.2 different analyses on an average. Morphological analyses are assigned to unknown words with an accurate rulebased 'guesser'. The morphological disambiguator uses constraint rules that discard illegitimate morphological analyses on the basis of local or global context conditions. The rules can be grouped as ordered subgrammars: e.g. heuristic subgrammar 2 can be applied for resolving ambiguities left pending by the more "careful' subgrammar 1. Older versions of EngCG (using about 1,150 constraints) are reported (~butilainen et al. 1992; Voutilainen and HeikkiUi 1994; Tapanainen and Voutilainen 1994; Voutilainen 1995) to assign a correct analysis to about 99.7% of all words while each word in the output retains 1.04-1.09 alternative analyses on an average, i.e. some of the ambiguities remait~ unresolved. These results have been seriously questioned. One doubt concerns the notion 'correct analysis". For example Church (1992) argues that linguists who manually perform the tagging task using the doubleblind method disagree about the correct analysis in at least 3% of all words even after they have negotiated about the initial disagreements. If this were the case, reporting accuracies above this 97% "upper bound' would make no sense. However, Voutilainen and J~rvinen (1995) empirically show that an interjudge agreement virtually of 1()0% is possible, at least with the EngCG tag set if not with the original Brown Corpus tag set. This consistent applicability of the EngCG tag set is explained by characterising it as grammatically rather than semantically motivated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Misspecifying GARCH-M Processes

We consider th e relationships between ARCH-type and stochast ic volatility models. A new class of volatility models, called generalized bilinear stochastic volatility, is described following an approach that tr ansforms an init ial GARCH-M process. Th e focus here is on th e interpretation of some simulation results, with a special care devoted to model misspecification.

متن کامل

Parents and children together (PACT): a collaborative approach to phonological therapy.

Developmental phonological disorders are a group of developmental language learn ing disorders of unknown aet iology, occurring at a phonological level, and man ifested in the use of abnormal speech patterns, by children, impairing their general intell igibility. This is one of a series of papers arising from an eY cacy study of a treatment model for developmental phonological disorders propose...

متن کامل

Quality control of 3D Geological Models using an Attention Model based on Gaze

The Geological Survey of the Netherlands (GSN) produces 3D stochast ic geological models of the upper 50 meters of the Dutch subsurface. The voxel models are regarded essential in answering subsurface questions on, for example, aggregate resources, groundwater flow, land subsidence studies and the planning of large-scale infrastructural works such as tunnels. GeoTOP is the most recent and detai...

متن کامل

(IC)LM-FUZZY TOPOLOGICAL SPACES

The aim of the present paper is to define and study (IC)$LM$-fuzzytopological spaces, a generalization of (weakly) induced $LM$-fuzzytopological spaces. We discuss the basic properties of(IC)$LM$-fuzzy topological spaces, and introduce the notions ofinterior (IC)-fication and exterior (IC)-fication of $LM$-fuzzytopologies and prove that {bf ICLM-FTop} (the category of(IC)$LM$-fuzzy topological ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002